In this project we will explores the univariate, bivariate and multivariate relationships between variables in Red Wine Quality Dataset using techniques in R.
This dataset is public available for research. It is a cortesy of: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
Available at: - [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
Input variables (based on physicochemical tests):
Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
Chlorides: the amount of salt in the wine
Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
Density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
Alcohol: the percent alcohol content of the wine
Output variables:
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Let’s take a look in our dataset and get some important information about the variables, the structure and schema of the data.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
After exploration, we can do some observations:
1. Large range in free.sulfur.dioxide and total.sulfur.dioxide (maybe outliers)
2. There are more ph and alcohol in the 3rd quartile
3. High median and mean in pH
4. In general, there is a low residual.sugar. However, max is too high (maybe outlier) 5. Quality doesn’t have a value greater than 8
6. As we see, quality is an ordinal categorical variable and we can create a new variable, according the note, assign labels.
7. According variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possible by dependent.
This is a section that contains some common functions will be used during this analysis.
## Warning: Removed 132 rows containing non-finite values (stat_bin).
The Red Wine Dataset had 1599 observations with 13 variables. All the variables are numerical, expect for quality, which is an categorical and rated it between 0 (bad) and 10 (excellent).
We want to analyse the quality, so quality is the main feature of interest.
I believe acidity, alcohol, density and pH could affect the wine quality.
Yes, we changed quality to an ordered factor and created a new variable called “quality_label” classifies wines as bad, good or very good based on quality.
This correlation matrix shows quality has : - alcohol (positive correlation) - sulphates (positive correlation) - citric.acid (positive correlation) - volatile.acidity (negative correlation) - total.sulphur.dioxide (negative correlation) - density (negative correlation)
So, a ‘very good’ wine usually has:
Let’s investigate how above chemical properties affect quality of wine.
##
## Pearson's product-moment correlation
##
## data: wine_ds$alcohol and as.numeric(wine_ds$quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: wine_ds$sulphates and as.numeric(wine_ds$quality)
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: wine_ds$citric.acid and as.numeric(wine_ds$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
We see that alcohol, sulphates, and citric.acid are positively correlated to quality of wine, but there are some outliers on the higher end of alcohol and sulphates for the wine of rating 5 for the quality. This says there might be other factors which decide the quality of the wine.
##
## Pearson's product-moment correlation
##
## data: wine_ds$volatile.acidity and as.numeric(wine_ds$quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: wine_ds$total.sulfur.dioxide and as.numeric(wine_ds$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: wine_ds$density and as.numeric(wine_ds$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
We see that volatile.acidity, total.sulphur.dioxide and density are inversely correlated to quality of the wine but there are some outliers which shows that there are other factors which decide the quality of the wine.
bivariate_scatterplot(wine_ds$citric.acid, wine_ds$fixed.acidity, 'citric.acid', 'fixed.acidity')
cor.test(wine_ds$citric.acid, as.numeric(wine_ds$fixed.acidity))
##
## Pearson's product-moment correlation
##
## data: wine_ds$citric.acid and as.numeric(wine_ds$fixed.acidity)
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## residual.sugar chlordies free.sulfur.dioxide
## 0.01373164 -0.12890656 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol
## 0.25139708 0.47616632
## 3 4 5 6 7 8
## 10 53 681 638 199 18
This graph explains the most of wines are rated of quality 5 and 6.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
To understand what chemical properites affect their quality, we started by exploring the relationship individual variables with quality and seeing which ones correlated most highly with the quality rating. Based on these findings, I explored the data further, concentrating on the effect of alcohol, volatile acidity, sulphates content and citric acid. My findings showed that most good red wines have a high alcohol, sulphate and citric acid levels and low volatile acidity.
Red wine dataset contains information on 1599 Portuguese Red Wines, limitating the analysis from a specific portugal region. It will be intresting to obtain datasets across various wine making regions to eliminate any bias created by any specific qualities of the product.
Further work could be carried out to build a model to predict the quality of red wine.